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Neural nets are known to be universal approximators. In particular, formal neurons implementing 
wavelets have been shown to build nets able to approximate any multidimensional task. Such very 
specialized formal neurons may be, however, difficult to obtain biologically and/or industrially. 
In this paper we relax the constraint of a strict "Fourier analysis" of tasks. Rather, we use a 
finite number of more realistic formal neurons implementing elementary tasks such as "window" or 
"Mexican hat" responses, with adjustable widths. This is shown to provide a reasonably efficient, 
practical and robust, multifrequency analysis. A training algorithm, optimizing the task with respect 
to the widths of the responses, reveals two distinct training modes. The first mode induces some of 
the formal neurons to become identical, hence promotes "derivative tasks". The other mode keeps 
the formal neurons distinct. 



I. INTRODUCTION 
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The ability of neural nets to be universal approximators has been proved by |jh,p| and studied by further authors in 

different contexts. For instance, neurons or small neuronal groups implementing "plane wave responses" have been 

considered by 0] and H. As well, pairs of neurons implementing "windows" have been investigated by fa]. Any 

I ' "complete enough" basis of functions which is able to span a sufficiently large vector space of response functions is of 

J> . interest, and, for instance, the wavelet analysis has been the subject of a complete investigation by M and M. 

(N ' . 

In this paper, we visit again the subject of a linear reconstruction of tasks, but with an emphasis upon neglecting 
\Q , the usual "translational" parameters. We mainly use a scale parameter only. This is somewhat different from the 

usual wavelet approach, which takes advantage of both translation and scale. But we shall find that a multifrequency 

reconstruction of tasks occurs as well. Simultaneously, we separate a "radial" from an "angular" analysis of the task. 
— ~ Finally, for the sake of robustness and biological relevance, we introduce a significant amount of randomness, corrected 

by training, in the choice of the implemented neuronal parameters. Furthermore, our basic neuronal units can be 
C^ ! those "window-like" pairs advocated earlier g, because of biological relevance too. Such deviations from the more 

rigorous approaches of || and |Q are expected to make cheaper the practical implementation of such neural nets. 

f^ . We also investigate two training operations. The first one consists in a trivial optimization of the output synaptic 

layer connecting a layer of intermediate, "elementary task neurons" to an output, purely linear neuron. The second 

training consists in optimizing the scale parameters of such a layer of intermediate neurons. It will be found that one 

*» ' may start from random values of such parameters and, however, sometimes reach solutions where some among the 

^ . intermediate neurons are driven to become identical. This "dynamical identification" training will be discussed. 

• i— I . 

In Section II we describe our formalism, including a traditional universality theorem. We also reduce the realis- 
tic, multi-dimensional situations to a one-dimensional problem. In Section III we illustrate such considerations by 
numerical examples of network training. Finally Section IV contains our discussion and conclusion. 

II. FORMALISM 
A. Definitions, architecture 

Consider an input X > which must be processed into an output (a task) F(X). This input is here taken to be a 
positive number, such as the intensity of a spike or the average intensity (or frequency) of a spike train. One may view 
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X as a "radial" coordinate in a suitable space. There is no loss of generality in restricting X to be a positive number, 
because, should negative values of X be necessary for the argument, then F could always be split into an even and 
odd parts, [F(X) ± F(—X)] /2, respectively. Such even and odd parts need only be known for X > 0, obviously. 
Outputs, in turn, will have both signs, in order to account for both excitation or inhibition. Finally there is no need 
to tell a scalar task F(X) from a vector task {iq,i<2, ■■■}, since any component Fk boils down to a separate scalar 
task, and this can be processed by a parallel architecture. 

Consider now neuronal units which, for instance may be excitatory-inhibitory pairs of neurons providing a window- 
like elementary response. Or they may be more complicated assemblies of neurons, providing a more elaborate "mother 
wavelet", such as a "Mexican hat". We denote f(X) the response function of such a unit and, for short, call this unit 
a "formal neuron" (FN). The traditional wavelet approach uses a set of such FN's with various thresholds b and scale 
sensitivities A, hence a space of elementary responses / [(X — b)/X] . The same approach expands F in this set, 

F(X) = Jdbd\w(b,X) f[(X-b)/X)], (1) 

where the integral is most often reduced to a discrete sum. Also, b and A do not need to be independent parameters. 
The expansion coefficients, w(b, A), are output synaptic weights and are the unknowns of the problem. This well 
known architecture is shown in Figure 1. 
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FIG. 1. All elementary units (FN's) receive the same input X. Each unit returns an output /, which depends on parameters 
such as a threshold b and a gain A. Output synaptic weights w(b, A) linearly mix such elementary outputs f(X;b,X) into a 
global output F(X). 



B. One-dimensional universality, radial case 



The following, seemingly poorer, but simpler expansion, 



/>oo 

F(X)= d\w(X)\- l f(X/\), 

Jo 



(2) 



does not use the translation parameter b. Here it is assumed that there exists a suitable electronic or biological tuning 
mechanism, able to recruit or adjust FN's with suitable gains A -1 , but no threshold tuning. Such gains are positive 
numbers, naturally. The outputs of such FN's are then added, via synaptic output efficiencies ttf(A), which can be 
both positive and negative, namely excitatory and inhibitory, respectively. The coefficient A -1 is introduced in Eq. 
(0) for convenience only. It can be absorbed in w(X). 

This expansion, Eq. (0) allows a universality theorem. Define Y = lnX and L = In A. The same expansion becomes, 



/oo 
dL W(L) g(Y - L), 
-OC 



(3) 



where W(L) = w (e L ) and g(Y) = f (e Y ) • This reduces the "scale expansion", Eq. (0), into a "translational 
expansion" where a basis is generated by arbitrary translations of a given function. The solution of this inverse 
convolution problem is trivially known as W(p) — G(p)/g(p), where the superscript " refers to the Fourier transforms 
of W, G and g, respectively, and p is the relevant "momentum" . This result will make our claim for universality. In the 
following, this paper empirically assumes that the needed analytical properties of /, /, ...W are satisfied. Actually, 
for the sake of biological or industrial relevance, we are only concerned with discretizations of Eq. (H), with N units, 
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F app (X) = 5>(A, ; ) f(X/Xi 



(4) 



j=i 



where we now let w include the coefficient A, dX. 



C. Rotational analysis 



Obviously, input patterns to be processed by a net cannot be reduced to one degree of freedom X only. Rather, 
they consist of a vector X with many components Xx,X%, ..., Ap. These may be considered as, and recoded into, a 

radial variable X 



J2n=i X? and, to specify a direction on the suitable hypersphere, (P — 1) angles a\, a%, ■•■, ctp-i- 
Enough special functions (Legendre polynomials, spherical harmonics, rotation matrices, etc.) are available to generate 
complete functional bases in angular space and one might invoke some formal neurons as implementing such base 
angular functions. The design of such FN's, and as well the design of such a polar coordinate recoding, is a little 
far fetched, though. In this paper we prefer to take advantage of the following argument, based upon the synaptic 
weights of the input layer, shown in Figure 2. 
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FIG. 2. Architecture showing how a task can be rotated by means of the input synaptic weights. 



In the left part of the Figure, Fig. 2, all the FN's have the same input synaptic weights u = {u\, 112, ..., up}, hence 
receive the same input X — u ■ X when contributing to a global task F. For the right part of Fig. 2 it is again assumed 
that all FN's have equal input weights, with, however, weights u' deduced from u by a sheer rotation, u' = IZu. 
Accordingly, if the output weights of the left part are the same as those of the right one, the global task F(u' ■ X) 
performed by the right part is a rotated task, F' — 1ZF. An expansion of any task T upon the (P — l)-rotation group 
is thus available, 



T = f dKW(K) KF 7 (5) 

where discretizations are in order, naturally, with suitable output weights W. Here F plays the role of an elementary 
task, and it might be of some interest to study cases where F belongs to specific representations of the rotation 
group. This broad subject exceeds the scope of the present paper, however, and, in the following, we restrict our 
considerations to scalar tasks F(X) of a scalar input X, according to Fig. 1 only. 

D. Training output weights 

Let us return to Eq. (0), in an obvious, short notation F app = J^tOj/i. Two kinds of parameters can be used to 
best reconstruct F: the output synaptic weights Wi and, hidden inside the elementary tasks fi, the scales A^. Let ( | ) 
denote a suitable scalar product in the functional space spanned by all the /j's of interest. We assume, naturally, 
that the same scalar product makes sense for the F's. Incidentally, there is no loss of generality if F is normalized, 
{F\F} — 1, since the final neuron is linear. 

One way to define the "best" F app is to minimize the square norm of the error (F — F app ). In terms of the Wi's, 
this consists in solving the equations, 

1 ( N N \ 

— i{F\F)-2j2^(fj\F)+J2^(fj\fk)w k \=0, i = l,...,N. (6) 

Wl \ 3=1 j,k=l J 

Let Q be that matrix with elements Qjk — (fj\fk)- Its inverse Q~ x usually exists. Even in those rare cases when Q 
is very ill-conditioned, or its rank is lower than N, it is easy to define a pseudoinverse such that, in all cases, the 
operator V = YJV j=l |/j) (G^ 1 ) ■■ (fj\ is the projector upon the subspace spanned by the /j's. Then a trivial solution, 
F app = VF, is found for Eqs. dfl), 



N 
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Given F and the /j's, this projection, which can be achieved by elementary trainings of the output layer of synaptic 
weights, will be understood in the following. It makes the Wj's functions of the Aj's. 

E. Training elementary tasks 

Now we are concerned with the choice of the parameters Xi of the FN's performing elementary tasks. This is of some 
importance, for the number N of FN's in the intermediate layer is quite limited in practice. The subspace spanned 
by the /j's is thus most undercomplete. Hence, every time one requests an approximator to a new F, an optimization 
with respect to the intermediate layer is in order, to patch likely weaknesses of the "projector" solution, Eqs (pi). 

Let us again minimize the square norm S — ( (F — F app ) \ (F ~ F app ) ) of the error. We know from Eqs. (0) that 
the w^s are functions of the A/s, but there is no need to use chain rules dwi/dXj d/dwi, because the same equations, 
Eqs. (g|), cancel the corresponding contributions, the w^s being optimal. Derivatives of fi with respect to their scales 
Xi are enough. The gradient of £, to be cancelled, reads, 

^L = ^{Xf(X/X j )\(F-F app )) = 0, j = l,...,N. (8) 

Here /' is the straight derivative of the reference elementary task, before any scaling. There is no difficulty in 
implementing a training algorithm for a gradient descent in the A-space. 

The next section, Sec. Ill, gives a brief sample of the results we obtained when solving Eqs. (J6|) and (|8|) for many 
choices of the global task F and elementary task /. 



III. NUMERICAL ILLUSTRATIVE EXAMPLES 



A. Symmetry and degeneracy 



Define for instance the scalar product in the functional space as, (fi\fj) = / dX fi(X) fj(X),\/ fi, fj. 
Among many numerical tests we show here the results obtained when the target task reads, F(X) = 
0.10167 e- x / 10 {0.60717tanh[4(X - 1.66133)] - 4.33575 tanh[4(A - 9.56591)]}. Let the elementary task of a FN read 
f(X/\) = {l-X 2 /\ 2 )e- x2 / { - 2x2 \ a Mexican hat. Set N = 5, and initial values 1/4, 1/2, 1, 2,4 for the Vs. Keeping 
Eqs. (|6j) satisfied at each step, start a gradient descent from such initial values. Our increments of the Aj's at each 
step read, 6Xi = —2d£/dXi, see Eqs. (g). After ~ 90 steps, a saturation of ||F app || 2 = (F\V\F) begins, see Figure 3. 
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FIG. 3. Learning curve: the norm \\F\\ of F app increases as a function of the number n of learning steps, then saturates. 
A comparison between F and F app is provided by Figure 4. 




FIG. 4. A target task (solid line) and its best approximation (dashed) after learning saturation. 



This saturation makes it reasonable to interrupt the learning process. For the sake of rigor, however, another run, 
with 1000 steps, was used to verify the saturation. While the saturation is confirmed, the convergence of the Aj's turns 
out to be slower. The values of the Aj's and Wi's at the end of this second run read {0.249, 0.535, 1.0512, 1.0522, 11.13} 
and {—0.0008, —0.0002, 38.107, —38.121, 0.3764}, respectively. The weakness of w\ and w 2 is explained by the lack of 
a fine structure in F. The large and almost opposite values of W3 and W4 clearly mean a renormalization of (/3 — f^), 
since A3 and A4 are so close to each other. Numerical difficulties linked to difference effects may therefore demand 
extra care in practical applications. We show in Figure 5 the way in which the A^'s evolved during the first 100 steps 
of the gradient descent. 




FIG. 5. Evolution of Logi Ai, i = 1, ...5, during learning, for one of the "derivative sensitivity cases". Notice the fusions of 
two pairs of scales, then the splitting of one of them. 

A temporary merging of Ai with A2, then a final episode in which they become distinct again, are striking, as well as 
the merging of A3 with A4 . It will be stressed at this stage that the whole process is invariant under any permutation 
of the Ai's (and of their associated w^s), hence a "triangular" rule, A^ < A^ + i can be implemented without restricting 
learning flexibility. Furthermore, as a symmetric function under pairwise exchanges of such parameters, the error 
square norm £ has a vanishing "transverse" derivative, d£/d(Xi — Xj) = 0, every time Ai = Xj. It is thus not 
surprising that, at least for part of the learning process, the learning path rides lines where such parameters merge. 

When merging occurs, the functional basis seems to degenerate since fi and fj are not distinct. It will be recalled, 
however, that our output neuron is linear, and nothing prevents the process from using the strictly equivalent repre- 
sentations, Wifi + Wjfj = (wi +Wj)/2 x (fi + fj) + {wi —Wj)/2 x (fi — fj). A trivial renormalization of the (fi — fj) term 
makes it that the functional basis still contains two independent vectors, namely, a new elementary response df/dX 
besides fi = fj. Naturally, the renormalization has a numerical cost, since both Wi and Wj must diverge. In practice, 
a minute modification of the "triangular rule", which becomes, in our runs, A,+i — A, > 10~ 3 , is enough to smooth 
our calculations. The conclusion of this merging phenomenon, for those F's where it occurs, is of some interest: new 
neuronal units (new FN's) may spontaneously emerge. These are "derivative sensitive", and may represent a new 
task dF/dX, or, if (p + 1) parameters merge, any further derivative d p f/dX p . 



B. Full Symmetry Breaking 

Most choices of F yield distinct values for the Aj's. We show in Figure 6 a trivial case. Here / = 1/[1 + (X 2 /A 2 )], 
a window-like elementary response, and the target task reads F — 9/(1 + 16X 2 ) + 5/(1 + AX 2 ) + 2/(1 + X 2 ) — 
1/[1 + (X 2 /4)] — 1/[1 + (X 2 /16)], a sum of such windows. We freeze A3 = 1, A4 = 2, and A5 = 4, a symmetry 
breaking situation, and clearly a part of the obvious solution for the minimum of £ . Then the contour map of £ in the 
{Ai,A2j-space does show the expected minimum for Ai = 1/4 and A2 = 1/2. The minimum turns out to be very flat, 
hence some robustness is likely for that special case. The learning process does reach this "fully symmetry breaking" 
configuration, together with the corresponding set of w^s, namely {9, 5, 2, — 1, —1}. Many other, less academic cases 
generate a full symmetry breaking, namely distinct Aj's. 




FIG. 6. Symmetry breaking case. Contours of the error in the vicinity of a symmetry breaking set of parameters. The last 
three out of five adjustable parameters are frozen and distinct. The minimum of the error is reached with unequal values of 
the first two parameters. 



C. More numerical results 



Besides "windows" and "Mexican hats", we also used oscillatory shapes such as (sinX)/Jf for /. A cut-off by an 
exponential decay was also sometimes introduced. The range of the scalar product integration was independently 
varied within one order of magnitude. Sometimes the dimension N of the elementary task basis was also taken 
as a random number, a test of little interest, however, which just verified that F app improves when N increases. 
For F, a few among our tests involved a small amount of random noise added to a smooth main part Ft> ac kground- 
Furthermore we investigated a fair amount of piecewise continuous F's, this case being of interest for image processing 
||. Alternately, we smoothed such discontinuities with a suitable definition of F, such as, F = J^ ct tanh[cr(X — /%)], 
with randomized choices of the number of terms, the coefficients a, the large "slope coefficient ct, and the positions 0t 
of the steep areas. The set of initial values for the Aj's before gradient descent was also sometimes taken at random. 
It was often found that a traditional sequence Ai ~ 2 l ~( A '+ 1 )/ 2 is not a bad choice for a start. 

All our runs converge reasonably smoothly to a saturation of the norm ||-F a pp||, provided those cases where Q 
becomes ill-conditioned are numerically processed. There is a significant proportion of runs where the optimum seems 
to be quite flat, hence some robustness of the results. Local minima where the learning gets trapped do not seem to 
occur very often, but this problem deserves the usual caution, with the usual annealing if necessary. We did not find 
clear criteria for predicting whether a given F leads to a merging of some Aj's, however. Despite this failure, all these 
results advocate a reasonably positive case for the learning process described by Eqs. (H) and (ph and the emergence 
of "derivative tasks" . 



IV. DISCUSSION AND CONCLUSION 

This paper tries to relate several issues. Most of them are well known in the theory of neural nets, but two of our 
considerations, the question of symmetries and the rotational analysis, might give reasonably original results, up to 
our knowledge at least. 

The most important and well known issue is that of the universality offered by nets whose architecture is described 
by Figures 1 and 2, namely four layers: input weights u, FN's for elementary tasks f with adjustable parameters 
M, output weights w, linear output neuron(s). The linearity of the output(s) can be summarized in any dimensions 
by the linear transform F(X) = JdMw(M)f(X;M). (We use here boldface symbols to stress that the linearity 
generalizes to any suitable vector and tensor situations for multiparameter inputs, intermediate tasks and outputs.) 
This linearity actually reduces the theory of such an architecture to a special case of the "generator coordinate" 
theory, well known in physics ||. As well, from a mathematical point of view, this boils down to the only question of 
the invertibility of the kernel f (X; M). Actually, the invertibility problem boils down into identifying those classes of 
global tasks F which belong to the functional (sub)space spanned by the f ' s. For the sake of definiteness, we proved a 
universality theorem for the very special case of "scaling without translating" , inspired by wavelets. But most of the 
considerations of this paper clearly hold if one replaces, mutatis mutandis, wavelets by other responses and scaling 
parameters by any other parameters. 

The parameters M can be defined as including the input synaptic weight vectors u, whose dimension is necessarily 
the same as that of the inputs X in order to generate the actual inputs u ■ X received by the intermediate FN's. When 
M also explicitly includes scale parameters A, there is no loss of generality in restricting the it's to be unitary vectors. 
Hence the linear kernel f can imply, in a natural way, an integration upon the group of rotations transforming all the 
■u's into one another. This part of the theory relates to the angular momentum projections which are so familiar in 
the theory of molecular and nuclear rotational spectra [|10[ . 

The well known issue of the discretization of a continuous expansion converts kernels into finite matrices, naturally. 
This paper studied what happens if one trains w for a temporary optimum of the approximate task F app , while M 
is not yet optimized. This implies a prejudice on training speeds: w fast learner, M slower. Other choices, such as 
w slower learner and M faster, for instance, are as legitimate, and should be investigated too. The question is of 
importance for biological systems, because of obvious likely differences in the time behaviors and biochemical and 
metabolic factors of synapses and cell bodies. The training speed hierarchy we chose points to one technical problem 
only, namely whether the Gram-Schmidt matrix Q of scalar products (fj|fj) is easily invertible or not. We do not 
use a Gram-Schmidt orthogonalization of the finite basis of such fa's, but the (pseudo) inversion of Q amounts to the 
same. Once Q~ x is obtained, temporarily optimal w are easily derived. 

Our further optimization of F app with respect to the parameters of the intermediate FN's takes advantage of the 
linearity of the output(s) and the symmetry of the problem under any permutation of the FN's. Let i label such 
FN's, i — 1,...,N and denote Mj the parameters of the i-th FN. We found cases where the gradient descent used 
to optimize F app induces a few Mj's to become quite close to one another. Such functional clusters, because of the 
output linearity, may yield elementary tasks corresponding to derivatives of f with respect to components of M. This 
derivative process may look similar to a Gram-Schmidt orthogonalization, but it is actually distinct, because no rank 
is lost in the basis. For those F's which induce such mergings of FN's, industrial applications should benefit from a 
preliminary simulation of training as a useful precaution, because, besides straight FN's implementing f , additional, 
more specific FN's implementing "derivative f's" will be necessary. For biological systems, diversifications of neurons, 
or groups of such, between tasks and "derivative tasks" might also be concepts of interest. It may be noticed that 
the word "derivative" may hold with respect to inputs as well as parameters. Indeed, as found at the stage of Eq. 
(3), scale parameters reduce, in a suitable representation, to translational parameters in a task g(Y — L). The sign 
difference between dg/dY and dg/dL is obviously inconsequential. 

To conclude, this emergence of "derivative elementary tasks" prompts us into a problem yet unsolved by our 
numerical studies with many different F's and many different f's: given the shape of f , can one predict whether a 
given F leads to a full symmetry breaking or to a partial merging of the FN's? 
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